Abstract

Author: Charles Tapley Hoyt

Estimated Run Time: 1 minute

This notebook demonstrates the utilities in PyBEL Tools that facilitate the exploration and expansion of subgraphs to allow for easier interpretation and contextualization of their underlying mechanisms. The data used in this notebook comes from the AETIONOMY Alzheimer's Disease (AD) knowledge assembly that has been annotated with the NeuroMMSig Knowledge Base.

Notebook Setup


In [1]:
import logging
import os
import sys
import time
from collections import Counter, defaultdict
from operator import itemgetter

import matplotlib.pyplot as plt
import networkx as nx

import pybel
import pybel_tools as pbt
from pybel.constants import *
from pybel_tools.visualization import to_jupyter
from pybel_tools.utils import barh, barv

In [2]:
%config InlineBackend.figure_format = 'svg'
%matplotlib inline

Notebook Provenance

The time of execution and the versions of the software packegs used are displayed explicitly.


In [3]:
time.asctime()


Out[3]:
'Mon Aug 14 14:45:31 2017'

In [4]:
pybel.__version__


Out[4]:
'0.7.2'

In [5]:
pbt.__version__


Out[5]:
'0.1.18-dev'

Local Path Definitions

To make this notebook interoperable across many machines, locations to the repositories that contain the data used in this notebook are referenced from the environment, set in ~/.bashrc to point to the place where the repositories have been cloned. Assuming the repositories have been git clone'd into the ~/dev folder, the entries in ~/.bashrc should look like:

...
export BMS_BASE=~/dev/bms
...

BMS

The biological model store (BMS) is the internal Fraunhofer SCAI repository for keeping BEL models under version control. It can be downloaded from https://tor-2.scai.fraunhofer.de/gf/project/bms/


In [6]:
bms_base = os.environ['BMS_BASE']

Data

The Alzheimer's Disease Knowledge Assembly has been precompiled with the following command line script, and will be loaded from this format for improved performance. In general, derived data, such as the gpickle representation of a BEL script, are not saved under version control to ensure that the most up-to-date data is always used.

pybel convert --path "$BMS_BASE/aetionomy/alzheimers.bel" --pickle "$BMS_BASE/aetionomy/alzheimers.gpickle"

The BEL script can also be compiled from inside this notebook with the following python code:

>>> import os
>>> import pybel
>>> # Input from BEL script
>>> bel_path = os.path.join(bms_base, 'aetionomy', 'alzheimers.bel')
>>> graph = pybel.from_path(bel_path)
>>> # Output to gpickle for fast loading later
>>> pickle_path = os.path.join(bms_base, 'aetionomy', 'alzheimers.gpickle')
>>> pybel.to_pickle(graph, pickle_path)

In [7]:
pickle_path = os.path.join(bms_base, 'aetionomy', 'alzheimers', 'alzheimers.gpickle')

In [8]:
graph = pybel.from_pickle(pickle_path)

In [9]:
graph.version


Out[9]:
'3.0.9'

In [10]:
# Add all canonical names for later
pbt.mutation.add_canonical_names(graph)

Connecting Components

The GABA Subgraph is explored in this example. This subgraph contains a representative group of genes, RNAs, proteins, biological processes, and pathologies; and all of their relations. It is extracted with pbt.selection.get_subgraph_by_annotation.


In [11]:
example_subgraph_name = 'GABA subgraph'

In [12]:
subgraph = pbt.selection.get_subgraph_by_annotation_value(graph, annotation='Subgraph', value=example_subgraph_name)

pbt.summary.print_summary(subgraph)


Nodes: 65
Edges: 182
Citations: 14
Authors: 69
Network density: 0.04375
Components: 5
Average degree: 2.8

In [13]:
to_jupyter(subgraph)


Out[13]:

The subgraph also contains elements with important unqualified edges, like the relationships between complexes and their members. These relationships can be enriched from the original graph using the function pbt.mutation.enrich_unqualified. For example, the connection between complex(p(HGNC:EGR1), p(HGNC:PSEN2)) and p(HGNC:PSEN2) is added during this process. The connection between the p(HGNC:APP) and p(HGNC:APP, frag(672_713)) is also recovered.


In [14]:
pbt.mutation.enrich_unqualified(graph, subgraph)

pbt.summary.print_summary(subgraph)


Nodes: 66
Edges: 189
Citations: 14
Authors: 69
Network density: 0.044055944055944055
Components: 3
Average degree: 2.8636363636363638

In [15]:
to_jupyter(subgraph)


Out[15]:

The graph also contains some related nodes, like r(HGNC:GABRA5) and p(HGNC:GABRA5) that are disconnected. Inferring the translation and transcriptional relationships between genes, RNAs, and proteins allows for connecting parts of the graph without much information. This can be accomplished with pbt.mutation.infer_central_dogma.


In [16]:
pbt.mutation.infer_central_dogma(subgraph)

pbt.summary.print_summary(subgraph)


Nodes: 138
Edges: 263
Citations: 14
Authors: 69
Network density: 0.013910927747804929
Components: 2
Average degree: 1.9057971014492754

In [17]:
to_jupyter(subgraph)


Out[17]:

Finally, some of the genes and RNAs that have been added have no connections, and can be removed with pbt.mutation.prune_central_dogma.


In [18]:
pbt.mutation.prune_central_dogma(subgraph)

pbt.summary.print_summary(subgraph)


Nodes: 66
Edges: 191
Citations: 14
Authors: 69
Network density: 0.04452214452214452
Components: 2
Average degree: 2.893939393939394

In [19]:
to_jupyter(subgraph)


Out[19]:

The concept of expansion then contraction is commonly called "opening" in the domain of image processing. Inference of the central dogma then removal of leaf genes and RNAs is such a standard operation that both steps can be run by pbt.mutation.opening_on_central_dogma.

Further Consideration

The fact that a subgraph contains more than one connected component probably means that there were errors in the original BEL script. There is an entire module devoted to analyzing the errors produced during compilation called pbt.summary.error_summary

However, it's also possible that the connections are due to lack of knowledge in the literature. In the curation process for the NeuroMMSig Database, many entitity types were not considered. We've developed an algorithm for inferring additional members of a subgraph, including chemicals that occur as intermediates in biochemical processes, and higher level entities such as biological processes. The set of tools for running the algorithm are avaliable in the pbt.mutations.subgraph_expansion submodule (see pbt.mutation.fill_subgraph).

Expanding on the Periphery

In this example, we'll look at the Estrogen Subgraph. The subgraph is enriched with unqualified edges and opened with the central dogma.


In [20]:
example_subgraph_name = 'Estrogen subgraph'

In [21]:
subgraph = pbt.selection.get_subgraph_by_annotation_value(graph, annotation='Subgraph', value=example_subgraph_name)

pbt.mutation.enrich_unqualified(graph, subgraph)
pbt.mutation.opening_on_central_dogma(subgraph)

pbt.summary.print_summary(subgraph)


Nodes: 25
Edges: 39
Citations: 10
Authors: 50
Network density: 0.065
Components: 5
Average degree: 1.56

In [22]:
to_jupyter(subgraph)


Out[22]:

The nodes along the periphery of this subgraph can be investigated with pbt.mutation.get_subgraph_peripheral_nodes. Below, it is used to output which nodes which aren't already in the Estrogen Subgraph, and how many in- and out-edges they have to it.


In [23]:
pnd = pbt.mutation.get_subgraph_peripheral_nodes(graph, subgraph, node_filters=pbt.filters.exclude_pathology_filter)

In [24]:
for node in sorted(pnd, key=lambda k: len(set(pnd[k]['successor']) | set(pnd[k]['predecessor'])), reverse=True):
    pred_d = pnd[node]['predecessor']
    succ_d = pnd[node]['successor']

    if 0 == len(pred_d) or 0 == len(succ_d):
        continue
    
    periphery = set(pred_d) | set(succ_d)
    
    if 4 > len(periphery):
        continue
    
    print(node, len(pred_d), len(succ_d), len(periphery))


('Protein', 'HGNC', 'APP', ('frag', (672, 713))) 7 2 7
('Protein', 'HGNC', 'MAPT', ('pmod', ('bel', 'Glyco'))) 1 5 6

The function pbt.mutation.expand_periphery automatically handles these calcuations and allows for the specification of a threshold for how "confident" it should be to add a node to the subgraph. Filters to exclude pathologies (which have many connections to everything). The inferred edges are limited to only causal edges, to avoid adding many low confidence relations. Luckily, the Estrogen Subgraph is small and doesn't become unmanagable after expanding along the periphery. Other, larger subgraphs might have this issue. If the subgraph becomes too complicated, it might be useful to extract the causal subgraph using pbt.selection.get_causal_subgraph.


In [25]:
pbt.mutation.expand_periphery(
    graph, 
    subgraph, 
    node_filters=pbt.filters.exclude_pathology_filter, 
    threshold=3)

pbt.summary.print_summary(subgraph)


Nodes: 33
Edges: 139
Citations: 34
Authors: 181
Network density: 0.13162878787878787
Components: 2
Average degree: 4.212121212121212

In [26]:
to_jupyter(subgraph)


Out[26]:

In this case, we were able to infer connections that not only gave the Estrogen subgraph more context, but also connected the individual components

Conclusions

A final touch to the subgraph might be to infer connections between nodes that have just been added. This can be done with pbt.mutation.expand_internal which allows for specification of edge filters, or with pbt.mutation.expand_internal_causal that is a thin wrapper, giving the edge filter pbt.filters.keep_causal_edges. Again, expansion on unqualified edges and opening with the central dogma can make this expanded subgraph easier to interpret.

Further, an unbiased expansion method could allow for annotations of entities to subgraphs such as chemicals and bioprocesses, and allow for more exotic enrichment algorithms to be implemented similar to NeuroMMSigDB.